Open In Colab

from IPython.display import HTML, display

# set path containing data folder or use default for Colab (/gdrive/My Drive)
local_folder = "../"
import urllib.request
urllib.request.urlretrieve('https://raw.githubusercontent.com/guiwitz/DLImaging/master/utils/check_colab.py', 'check_colab.py')
from check_colab import set_datapath
colab, datapath = set_datapath(local_folder)

10. Convolutions and rescaling

In order to learn the basics of neural networks as well as higher-level DL packages we used simple neural nets consisisting only of linear layers (fully connected) and activations. The information in images has however a specific structure which calls for other types of layers. In particular convolution plays a central role in this area.

From global to local information

When we linearize an image to make it pass through an linear layer, the underlying assumption is that all pixels connected to a given activation in a layer are equivalent to each other. This is obviously an oversimplification: in most cases, single pixels have a local context that gives them sense, and e.g. a pixel in a the upper left corner of an image is not much related to one in the lower left. Of course all pixels of an image of the sea are still related to each other in the sense that they belong to waves, are blue etc. while those of an image of a forest belong to leaves, are green etc. This global connection can also be taken into account by looking at coarse-grained versions of the image.

Convolution filters as well as rescaling operations can specifically recover the type of information mentioned above: convolution recover local information while rescaling allows us to do that at different scales.

Convolution

A convolution is simply image filtering: a small image, the filter \(f\), travels across the actual image \(I\) and performs locally a sum/product operation at each pixel, generating a filtered image \(F\). At pixel \(a,b\) in the image, we perform \(F_{a,b} = \sum_{ij}^{N,M}f_{ij}I_{a+i,b+j}\) where \(N,M\) are the filter dimensions. In other words, we generate a new image via a local operation at each pixel. Below you can find an illustration of this with a filter that is just uniform, with the result that the resulting operation is capturing the local mean in the image. Here the result is the smoothing out of sharp edges.

HTML(url='https://raw.githubusercontent.com/guiwitz/DLImaging/master/illustrations/convol_mean.html')

As illustrated below, convolution can be seen as local linear filters. This should make it clear that for convolutions the weights are the filters themselves.

Convolution layer

Usually when we integrate a convolution layer in a neural network, we don’t use a single filter but an entire series. The result is a series of filtered images stacked together in a “volume”. Those are the “cubes” one can typically see in schematics of neural networks.

Each filter will detect specific features. We have seen above a mean filter but other filters can detect many other feautres in an image. For example a filter with vertical stripes of value -1, 0, 1 (see below) will detect edges in an image:

HTML(url='https://raw.githubusercontent.com/guiwitz/DLImaging/master/illustrations/convol_edge.html')

Training convolution layers

In classic image processing or even Machine Learning, designing filters is an important part of an algorithm. In Deep Learning one entirely skips this design step and has the network learn the filters. Just like in a linear layer we optinize the weights (or parameters) here we optimize the filters which are composed of the weights. After training we might recover some known filters but they are not enforced. Just like in a linear layer, the filters are randomly initialized (following some rule).

Basic convolution layer.

We now create a simple layer to see how it works and what options we have

import torch
import torch.nn as nn
conv_filter = nn.Conv2d()
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Input In [5], in <module>
----> 1 conv_filter = nn.Conv2d()

TypeError: __init__() missing 3 required positional arguments: 'in_channels', 'out_channels', and 'kernel_size'

We see that we need three pieces of information:

  • in_channels: the number of input images. For a black and white image that is 1, for an RGB image 3 etc. Deeper in the network this input size is the output size of the previous layer, which can be arbitrary

  • out_channel: the number of different filters we want to use on the image, each producing a filtered image

  • kernel_size: the size of the filter. In the above examples that would be 3 as our filters covered a 3x3 pixel area

So let’s try again:

conv_filter = nn.Conv2d(in_channels=1, out_channels=5, kernel_size=3)

Now we can use this module as any othe one. We first load an image from the drawing dataset:

import numpy as np
import matplotlib.pyplot as plt

images = np.load('../data/quickdraw/full_numpy_bitmap_violin.npy')
image = torch.tensor(np.reshape(images[0], (28,28)), dtype=torch.float32)
plt.imshow(image);
../_images/10-Convolutions_draw_17_0.png
output = conv_filter(image)
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Input In [9], in <module>
----> 1 output = conv_filter(image)

File ~/miniconda3/envs/CASImaging/lib/python3.9/site-packages/torch/nn/modules/module.py:1102, in Module._call_impl(self, *input, **kwargs)
   1098 # If we don't have any hooks, we want to skip the rest of the logic in
   1099 # this function, and just call forward.
   1100 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1101         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1102     return forward_call(*input, **kwargs)
   1103 # Do not call functions when jit is used
   1104 full_backward_hooks, non_full_backward_hooks = [], []

File ~/miniconda3/envs/CASImaging/lib/python3.9/site-packages/torch/nn/modules/conv.py:446, in Conv2d.forward(self, input)
    445 def forward(self, input: Tensor) -> Tensor:
--> 446     return self._conv_forward(input, self.weight, self.bias)

File ~/miniconda3/envs/CASImaging/lib/python3.9/site-packages/torch/nn/modules/conv.py:442, in Conv2d._conv_forward(self, input, weight, bias)
    438 if self.padding_mode != 'zeros':
    439     return F.conv2d(F.pad(input, self._reversed_padding_repeated_twice, mode=self.padding_mode),
    440                     weight, bias, self.stride,
    441                     _pair(0), self.dilation, self.groups)
--> 442 return F.conv2d(input, weight, bias, self.stride,
    443                 self.padding, self.dilation, self.groups)

RuntimeError: Expected 4-dimensional input for 4-dimensional weight [5, 1, 3, 3], but got 2-dimensional input of size [28, 28] instead

We passed a 2D tensor but a 4D one is expected. This is the same problme we have faced before. The first dimension is the size of the batch the second one is the channels (e.g. 3 for RGB) and the two last ones the image dimension. Here we have a single image, so we have to add two dimensions to it of size 1:

image_more_dims = image.unsqueeze(dim=0).unsqueeze(dim=0)
output = conv_filter(image_more_dims)
output.size()
torch.Size([1, 5, 26, 26])

Our output has now 5 channels or 5 features, each image being the original image filtered by another filter:

fig, ax = plt.subplots(1,5, figsize=(15,15))
for i in range(5):
    ax[i].imshow(output[0,i,:,:].detach().numpy())
../_images/10-Convolutions_draw_23_0.png

This could be our first layer in a convolutional neural network. Now just like after a linear layer, we can add a non-linearity e.g. as a ReLU layer:

relu_layer = nn.ReLU()
output2 = relu_layer(output)
output2.shape
torch.Size([1, 5, 26, 26])
fig, ax = plt.subplots(1,5, figsize=(15,15))
for i in range(5):
    ax[i].imshow(output2[0,i,:,:].detach().numpy())
../_images/10-Convolutions_draw_27_0.png

Finally we can add a new convolutional layer, with input_size=5 this time. With output_size=1 we recover a single image, which is somehting we want e.g. in segmentation tasks:

conv_filter2 = nn.Conv2d(in_channels=5, out_channels=1, kernel_size=3)
output3 = conv_filter2(output2)
fig, ax = plt.subplots(figsize=(5,5))
ax.imshow(output3[0,0,:,:].detach().numpy());
../_images/10-Convolutions_draw_31_0.png

Output dimensions

Maybe you have noticed that the image dimensions have changed along our layers:

image_more_dims.size()
torch.Size([1, 1, 28, 28])
output.size()
torch.Size([1, 5, 26, 26])

What happened here? As you could see in the interactive filtering animation above, when the filter is centered on a pixel on the image border, part of the filter is outside the image. Here we have two choices:

  • don’t calculate the filtering for edge pixels

  • pad the image with the necessary amount of rows and columns and fill with some values, so that the ouput size remains the same as the input size

Obviously the latter happended here. Our filter is 3x3, meaning that we loose a 1 pixel wide border around the image, Since we loose that pixel at the bottom/top left/right we end up with \(28-1-1=26\). To avoid having to juggle with chaning dimensions, it is simpler to used padding e.g.:

conv_filter_pad = nn.Conv2d(in_channels=1, out_channels=5, kernel_size=3, padding=1)
output_pad = conv_filter_pad(image_more_dims)
output_pad.size()
torch.Size([1, 5, 28, 28])

Re-scaling

In order to use both small scale details of an image (e.g. person eyes) as well as larger scale (e.g. person head) we need to be able to apply convolutions at different scales. The solution to this is to successively reduce the iamge size between convolution layers. There are multiple ways of doing this:

  • max pooling: group pixels e.g. 2x2 (4 pixels) and replace them by their maximum value

  • average pooling: group pixels e.g. 2x2 (4 pixels) and replace them by their mean value

  • change convolution stride: in the above examples the filter moved in steps of 1 pixels across the image. We can also change this and have it moving e.g. in steps - or stride - of 2, which will rescale the image by a factor 2.

We will see that sometimes we even need to do the opposite: make a feature map that has been downscaled larger again. There are mainly two ways of doing this one purely deterministic and one involving learning of parameters:

  • Upsampling: using some approximation to add new pixels in an image. One can e.g. just replicate pixels in a larger neighbourhood or calculate an interpolation between pixels

  • Transpose convolution (deconvolution): this is the reverse of convolution. It can be difficult to properly undertand what deconvolution is (see here for a great visual demo). The idea is to distribute pixel values in a larger area by training a kernel that does the distribution.

Stride

We first illustrate with a convolution layer with a stride of 2:

nn.ConvTranspose2d
nn.MaxUnpool2d
nn.Upsample
torch.nn.modules.upsampling.Upsample
conv_filter_stride = nn.Conv2d(in_channels=1, out_channels=5, kernel_size=3, padding=1, stride=2)
output_stride = conv_filter_stride(image_more_dims)
output_stride.size()
torch.Size([1, 5, 14, 14])
fig, ax = plt.subplots(1,2)
ax[0].imshow(image_more_dims[0,0])
ax[1].imshow(output_stride[0,0].detach())
<matplotlib.image.AxesImage at 0x13dd45130>
../_images/10-Convolutions_draw_43_1.png

Pooling

For average and max pooling, we need to decide the size of the pixel group that we want to combine:

av_pool = nn.AvgPool2d(kernel_size=2)
max_pool = nn.MaxPool2d(kernel_size=4)
output_avpool = av_pool(image_more_dims)
output_maxpool = max_pool(image_more_dims)
print(f'size of average pool output for 2x2 groups: {output_avpool.size()}')
print(f'size of max pool output for 4x4 groups: {output_maxpool.size()}')
size of average pool output for 2x2 groups: torch.Size([1, 1, 14, 14])
size of max pool output for 4x4 groups: torch.Size([1, 1, 7, 7])
fig, ax = plt.subplots(1,3)
ax[0].imshow(image_more_dims[0,0])
ax[0].set_title('raw')
ax[1].imshow(output_avpool[0,0].detach())
ax[1].set_title('Average pool 2x2')
ax[2].imshow(output_maxpool[0,0].detach())
ax[2].set_title('Max pool 4x4');
../_images/10-Convolutions_draw_47_0.png

Transpose convolution

nn.ConvTranspose2d The nn.ConvTranspose2d is really designed to “reverse” the effect of a convolution layer. So if we use the same parameters, we can simply recover the original image shape.

conv_filter_stride = nn.Conv2d(in_channels=1, out_channels=1, kernel_size=2, padding=0, stride=2)

transpose_conv_filter_stride = nn.ConvTranspose2d(in_channels=1, out_channels=1, kernel_size=2, stride=2)
output_stride = conv_filter_stride(image_more_dims)
output_tr_conv = transpose_conv_filter_stride(output_stride)
image_more_dims.size()
torch.Size([1, 1, 28, 28])
output_stride.size()
torch.Size([1, 1, 14, 14])
output_tr_conv.size()
torch.Size([1, 1, 28, 28])
plt.imshow(output_tr_conv[0,0].detach())
<matplotlib.image.AxesImage at 0x13dbda790>
../_images/10-Convolutions_draw_54_1.png

A real convolutional network

As a first example of a network including convolutions we will continue with our drawing dataset. First we create again our dataloader which loads images form npy files on disk:

import pytorch_lightning as pl
from torch.utils.data import Dataset, DataLoader, random_split
from torch.functional import F
from torch import nn
import torch
import numpy as np

Dataset

For the dataset, we use the same approach as previously.

from torchvision import transforms

transformations = transforms.Compose([
    transforms.ToTensor(),
])

class Drawings(Dataset):
    def __init__(self, data, targets, transform=None):
        self.data = data
        self.targets = torch.LongTensor(targets)
        self.transform = transform

    def __getitem__(self, index):
        x = self.data[index]
        x = np.reshape(x, (28,28))
        y = self.targets[index]

        if self.transform:
            x = self.transform(x)

        return x, y

    def __len__(self):
        return len(self.targets)
num_data = 10000
batch_size = 10

folders = list(datapath.joinpath('data/quickdraw').glob('*npy'))
label_dict = {i:f.name.split('_')[-1][:-4] for i, f in enumerate(folders)}

data = np.concatenate([np.load(f)[0:num_data] for f in folders]) #check everything works with tiny set
labels = np.concatenate([[ind for i in range(num_data)] for ind, f in enumerate(folders)]) #check everything works with tiny set

drawings = Drawings(data, labels, transformations)

train_size = int(0.8 * len(drawings))
valid_size = len(drawings)-train_size

train_data, valid_data = random_split(drawings, [train_size, valid_size])

train_loader = DataLoader(train_data, batch_size=batch_size, shuffle=True)
validation_loader = DataLoader(valid_data, batch_size=batch_size, shuffle=True)

Network creation

We start here by adding a single convolution layer at the very beginning of the network. So instead

class Mynetwork(pl.LightningModule):
    def __init__(self, num_categories, im_size):
        super(Mynetwork, self).__init__()
        
        self.im_size = im_size
        self.final_size = self.im_size // 2
        
        # define e.g. layers here e.g.
        self.conv1 = nn.Conv2d(in_channels=1, out_channels=20, kernel_size=3, padding=1)
        self.maxpool1 = nn.MaxPool2d(kernel_size=2)
        #self.conv2 = nn.Conv2d(in_channels=20, out_channels=20, kernel_size=3, padding=1)
        #self.maxpool2 = nn.MaxPool2d(kernel_size=2)
        self.linear = nn.Linear(20*self.final_size*self.final_size, num_categories)
        self.loss = nn.CrossEntropyLoss()

    def forward(self, x):
        
        # define the sequence of operations in the network including e.g. activations
        x = F.relu(self.conv1(x))
        x = self.maxpool1(x)
        #x = F.relu(self.conv2(x))
        #x = self.maxpool2(x)
        x = x.flatten(start_dim=1)
        x = self.linear(x)
                
        return x
    
    def training_step(self, batch, batch_idx):
        
        x, y = batch
        output = self(x)
        loss = self.loss(output, y)
        
        self.log('loss', loss, on_epoch=True, prog_bar=True, logger=True)

        return loss
    
    def validation_step(self, batch, batch_idx):
        
        x, y = batch
        output = self(x)
        accuracy = (torch.argmax(output,dim=1) == y).sum()/len(y)

        self.log('accuracy', accuracy, on_epoch=True, prog_bar=True, logger=True)
        
        return accuracy
        
    def configure_optimizers(self):
        return torch.optim.Adam(self.parameters(), lr=1e-3)
model = Mynetwork(num_categories=3, im_size=28)

Before we train, let’s see if we set all dimensions correctly:

model(image_more_dims)
tensor([[  2.7191, -27.9978,  27.0940]], grad_fn=<AddmmBackward0>)

Training

Finally we train our network as usual with Lightning

trainer = pl.Trainer(max_epochs=2)
GPU available: False, used: False
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
trainer.fit(model, train_dataloaders=train_loader, val_dataloaders=validation_loader)
  | Name     | Type             | Params
----------------------------------------------
0 | conv1    | Conv2d           | 200   
1 | maxpool1 | MaxPool2d        | 0     
2 | linear   | Linear           | 11.8 K
3 | loss     | CrossEntropyLoss | 0     
----------------------------------------------
12.0 K    Trainable params
0         Non-trainable params
12.0 K    Total params
0.048     Total estimated model params size (MB)
                                                              
/Users/gw18g940/miniconda3/envs/CASImaging/lib/python3.9/site-packages/pytorch_lightning/trainer/data_loading.py:659: UserWarning: Your `val_dataloader` has `shuffle=True`, it is strongly recommended that you turn this off for val/test/predict dataloaders.
  rank_zero_warn(
/Users/gw18g940/miniconda3/envs/CASImaging/lib/python3.9/site-packages/pytorch_lightning/trainer/data_loading.py:132: UserWarning: The dataloader, val_dataloader 0, does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` (try 4 which is the number of cpus on this machine) in the `DataLoader` init to improve performance.
  rank_zero_warn(
/Users/gw18g940/miniconda3/envs/CASImaging/lib/python3.9/site-packages/pytorch_lightning/trainer/data_loading.py:132: UserWarning: The dataloader, train_dataloader, does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` (try 4 which is the number of cpus on this machine) in the `DataLoader` init to improve performance.
  rank_zero_warn(
Epoch 0:  49%|████▉     | 1471/3000 [00:36<00:38, 40.21it/s, loss=0.187, v_num=3, loss_step=0.227] 
/Users/gw18g940/miniconda3/envs/CASImaging/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py:688: UserWarning: Detected KeyboardInterrupt, attempting graceful shutdown...
  rank_zero_warn("Detected KeyboardInterrupt, attempting graceful shutdown...")

Verify the output

mybatch, mylabel = next(iter(validation_loader))
pred = model(mybatch)
pred = pred.argmax(dim=1)
pred == mylabel
tensor([False,  True,  True,  True,  True,  True,  True,  True,  True,  True])
fig, ax = plt.subplots(1,10, figsize=(10,2))
for ind, im in enumerate(mybatch):
    ax[ind].imshow(im[0])
    title = label_dict[mylabel[ind].item()] + '\n' + label_dict[pred[ind].item()]
    ax[ind].set_title(title)
Epoch 0:  49%|████▉     | 1471/3000 [00:49<00:51, 29.43it/s, loss=0.187, v_num=3, loss_step=0.227]
../_images/10-Convolutions_draw_72_1.png

Interpretation

It seems that we have managed to train our network to classify our drawings. But as is often the case, we have no idea how our model “decides” the category.

There are however tools that allow one to get some insight into the inner workings of the network. We don’t explore here this topic in depth but give an example demonstrate this aspect of the work with deep learning.

We will use here the concept of saliency which reflects the importance of each pixel in the classifying decision. The principle is the following:

  • for a given image, find the largest activation in the last layer, i.e. the one that predicts the class

  • instead of calculating the gradients of the loss function, calculate the gradients of that maximum activation by backpropagation

  • find the gradients of the input, i.e. find which pixel would most change the prediction if it changed

We can do all this manually. First integrate the input in the gradient calculation:

ind=0
myinput = mybatch[ind].unsqueeze(0)
myinput.requires_grad = True
plt.imshow(myinput[0,0].detach());
../_images/10-Convolutions_draw_75_0.png

Find the maximum activation

scores = model(myinput)
score_max_index = scores.argmax()
score_max = scores[0,score_max_index]

Calculate the gradients by backpropagation:

score_max.backward()

Find the gradient with respect to the input and take it’s absolute value (largest influence)

saliency = myinput.grad.data.abs().squeeze()

The output is a map indicating the importance of each pixel. We can look at it alone or superpose it to the original image:

fig, ax = plt.subplots(1,2)
ax[0].imshow(myinput[0,0].detach(), cmap='gray')
ax[0].imshow(saliency[:,:], cmap='Reds', alpha = 0.5)
ax[1].imshow(saliency[:,:], cmap='Reds')
<matplotlib.image.AxesImage at 0x13f94ee50>
../_images/10-Convolutions_draw_83_1.png

What we did above can be automated with packages dedicated to this interpretation. Captum is one of such packages which offers many more quantifications than just the saliency. The latter would be produced with the following code:

from captum.attr import Saliency
from captum.attr import visualization as viz
saliency = Saliency(model)
myinput.requires_grad = True
grads = saliency.attribute(myinput, target=mylabel[ind].item())
grads = grads.squeeze().cpu().detach().numpy()
original_image = myinput.detach().numpy()[0]
plt.imshow(grads, cmap='Reds');
../_images/10-Convolutions_draw_87_0.png

Exercises

  1. Try to train the same network on the second dataset quickdraw_alt. What do you observe?

  2. Try to add an additional layer of convolution + max-pooling layer. Make sure all your inputs/outputs have the correct size

  3. Does that help to improve classification of the second dataset?